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Abstract: This paper describes a approach to text-to-speech synthesis (TTS) based on HMM. In the proposing approach, 
speech spectral parameter sequences are generated from HMMs directly based on maximum likelihood criterion. By 
considering relationship betw een static and dynamic features during parameter generation, smooth spectral sequences are 
generated according to the statistics of static and dynamic parameters modelled by HMMs, resulting in natural sounding 
speech. In this paper, first, the algorithm for parameter generation is derived, and then the basic structure of an HMM based 
TTS system is described. Results of subjective experiments show the effectiveness of dynamic feature. 
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I. Introduction 

A Hidden Markov Model (HMM) is finite state machine which generates a sequence of discrete time observation. 
At each time unit (frame) the HMM changes state according to state transition probability distribution, and then generates an 
observation o, at time t according to output probability distribution of the current state .Hence the HMM is doubly stochastic 
random process model. 

An N state HMM is defined by state transition probability distribution A={aij}N i,j = Oand output probability 
B=jbj(o)j N j=0 and initial state probability distribution ti= {?ii} N i=o ■ For convenience the compact notation. 



* =(A,B,7t) 




(a) A 3-state ergodic model (b) A 3-state left-to-right model 

Figure 1. Examples of HMM parameter is used to indicate parameter set of the model 

Figure 1 shows the HMM model Figure 1 (a) shows the 3 state ergodic state ,in every state of model could be 
reached from every other state of the model in single step figure 1(b) shows the a 3 state left to right model in which the 
state index increases or stays same as time increase. Generally left to right HMMs are used to model speech parameter 
sequence since they can appropriate model signal whose property changes in successive manner. The output probability 
distribution bj(o t ) can be discrete or continuous depending on the observation. Usually in continuous distribution HMM 
(CDHMM) an output probability distribution is modelled by mixture of multivariate Gaussian distribution which as follows 
M 

bj(o)= HwjJ^(o\fi jm ,,UjJ 
m=l 

Where M= number of mixture component Wjm, jUjm, Ujm are weight, a mean vector, and covariance matrix component m 

of state j, respectively. A Gaussian distribution N (o\p.j m ,Uj,„) is defined by 

N(o\p. jm „U jm ) = (l/(V20 d |t/|) expC-l/ZCo-z/M,,,) 1 /.',, '(o - ii jm) 

Where d is dimensionality of o. mixture weight w jm satisfies the stochastic constraint 

Iw„„=l l<j<N 

W jm >0 l<j<N, l<m<M 

So that bj(o) are properly normalized . 

jfi)j(o)do = 1 l<j<N 

When the observation vector o is divided into S independent data stream i.e. o = [o! T ,o 2 T , o 3 T , o s T ] T bj(o) is 

formulated by product of Gaussian mixture densities. 
S 

bj(o)=D 1^,(0,) 
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S M s 

b,(o)= □ {Qvp m N(o s \fij sm y jsm )} 
s=l m=l 
Likelihood calculation 

When the state sequence is determined as Q= (qi q 2 q3 q 4....qT ), the likelihood of generating an observation sequence 
0=(oi_ o 2 , o 3 , o 4> . . . .o T ) is calculated by multiplying the state transition probabilities and output probabilities for each state 

T 

p (°>j) = ^Aqt - l,Aqt,Bqt,0(t) 

Where A q0 j denote 7tj. The likelihood of generating O from HMM A is calculated by summing P(0,Q/A) for all possible 
sequences 

T 

p q = z n Aqt ~ % Aqt> Bqt> o(t) 

Qall t=l 

The likelihood of above equation is sufficiently calculated using forward and/or backward procedure . 
The forward and backward variables are 



a,(i )=P(o 1 ,o 2 , o T ,q t =iA.) 

p t (i)=P(o t+1 .(),.;. o,.q i.>.) 

can be calculated individually as 

1 . Initialization 

a,(i)=n biCoO l<i<N 
p T (i)=l l<i<N v 

2. Recursion 

a T+ i(i)=[Sf=i at (j)aji ]bj(t+l) 1 <i<N 
t=2, T 

3. Termination 



P* =max[(§ T (i) ] 
q*=argmax[(8 T (i) ] 
4. Path back tracking 
?; = T t+1 (q t+1 ) 

Maximum Likelihood Estimation of HMM parameter 

There is no known method to analytically obtain the model parameter set based on maximum likelihood based on 
maximum likelihood (ML) criterion ., that is to obtain which maximises likelihood P(OA.) for a given observation 
sequence O , in a closed form. Since this problem is a high dimensional nonlinear optimization problem, and there will be 
number of local maxima . , it is difficult to obtain X which globally maximizes P(0/X) and can be obtained using an 
iterative procedure such as the expectation -maximization (EM) algorithm ( which is often referred to as Baum-Weich 
algorithm), and the obtained parameter set will be a good estimate if a good initial estimate is provided . 

In the following , the EM algorithm for the CD-HMM are described . The algorithm for the HMM with discrete output 
distribution can also be derived in the straight forward manner 

Q-Function 

In the E M algorithm , an auxiliary function Q(X\X) of current parameter set X' and new parameter set X is defined 
as follows 

Q(r,X)=^P(0,Q\X)\ogP(0,Q\X). 

Here , each mixture component is decomposed into a substrate and Q is redefined as a substrate sequence i.e. 
Q=((qi ,si) ,(q 2 ,s 2 ), (q T ,s T ) 

Where (qT,s-r) represents the being substrate s t of state q t at time t. 

At each iteration of procedure current parameter set X' is replace by new parameter set which maximises Q(X\X). 
This iterative procedure can be provided to increase likelihood P(0|X) monotonically and converge to a certain critical 
point since it can provide that Q- function satisfies the following theorem 
Theorem 1 

Q(X',X)> Q(r,V) i.e. P(0|^>P(0|?i') 
Theorem 2 

The auxiliary function Q(k\X) has a unique global maximum as a function of X and this is the one and the critical point. 
Theorem 3 

A parameter set X is the critical point of the likelihood P(0|X) if and only if it is a critical point of the Q- function. 

Maximization of the Q-Function 

logP(0, 0\X) can be written as 

logP(0, 0|A,)=ELi a 1 1 ~ Iqt +l T t=i wqtst + X logi^iAT (ot\/jqt st,Uqt st), 
where a q0q i denotes n ql . Hence the Q function can be written as 



www.ijmer.com 



1895 I Page 



International Journal of Modern Engineering Research (IJMER) 
www.ijmer.com Vol.3, Issue.4, Jul - Aug. 2013 pp-1894-1899 ISSN: 2249-6645 

Q(l \X) =Ef=i P(0, ql = i \l 'Jloglli 

+Sf=iSf=iS[=i 1 / J (0,qt = i,qt+l= j\X)logaij+Y^ = iY^ =1 YJ t=1 P{0,qt = i,st = k\X)logwqtst+ 
Y. N i=iY,t=iYlt=iP(0,qt = l,st = k\X)logK{pt\nqtst,Uqtst) 



:an be derived from lagaranges o 



I 

if, 

differential calculs. 



m = 1 

aij = 1 l<i<N 



Probability of state i being i 
yt(i)=P(0,qt = i\X)=- 



Z[_irt(0 H Z[ =1 rt(i,fc) 
: t and probability of state i being at t+1 ar 
*S»V yt(i,k)=P(0,qt=l,st=k|^)= 



(Ofa(i) W jkX{ot\Mk,Ujk) 
zt (j )/?t 0) ' S^ =1 w;mAf (ot |w m 't/jm 



§t(i,j)=P(0,qt=i,qt+l=j|^)= 



2t(i)t.j(ot+i)/;t+io) 



Et=iZS=i' 



n (ot+l)/?t+l(n) 



II. Method 

The system consists of two stages : the training stage and the synthesis stage. First in training stage mel-cepstrumm 
coffecient are obtained from the speech signal by delta -delta mel-cepstral coefficient. Then phoneme HMM are trained 
using mel-cepstral coefficient and their deltas and deltas-deltas . 

In the synthesis stage an arbitrary given text to be synthesized is transformed into phoneme sequence . According 
to phoneme sequence , a sentence HMM which represents the whole text to be synthesized is constructed by concatening 
phoneme HMMs. From the sentence HMM ,a speech parameter is generated using the algorithm for speech parameter 
generation for HMM By using Mel-Log spectral Approximation speech is synthesized from the generated mel-spectral 
coefficient. 
Speech data base 

HMM are trained using 503 phonetically balance sentences uttered by male speaker . Speech signal is sampled at 
20KHz and downsampled to lOKHz and re-labelled using 60 phonemes and silence given in table 1. Unvoiced vowles 
with previous consonants are tre ated as individual phonemes e.g shi is composed of unvo iced i with previous sh. 
Vowels 



Consonants 

N, m , n ,y,w,r,p,pp,t,tt,k,kk,b,d,dd, 
g, ch cch ts tts s ss sh, ssh, h,f ,ff,z 
j, my ny ry by gyp y ppy ky kky hy 



Unvoiced vowels with previous a 

pi, pu,ppi,ki,ku,kku,chi,cchi,tsu,su,shi, 

shu, sshi,sshu, hi, fu 

Table : 1 Phoenems used in system 

SPEECH DATABASE) 




Sentence HMM - 



[SYNTHESIZED SPEECH 

Figure:2 Block diagram of HMM based speech synthesis system. 

Speech Analysis 

Speech signal are windowed by 25.6ms Blackman window with 5ms shift , then mel cepstral coefficients are obtained by 
15 th order mel-cepstral analysis. The dynamic feature Ac t and AAc t i.e. delta and delta -delta mel-cepstral coefficient at 
frame t are calculated as 
Ac, =- — (c t+ i - c t _j), A 2 c t = -(Ac t+ i - Ac t _i) 

The feature vector is composed of 16 mel- cepstral coefficient including the zeroth coefficie 
delta coefficient . 



t and their delta and delta - 
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Training of HMM 

All HMM used in system were left- to - right model with no skip . each state has single Gaussian distribution with 
diagonal convergence . Initially a set of microphone models were trained . These models were cloned to produce a triphone 
model for all distinct triphones in the training data. The triphone models were reestimated with the embedded version of 
Bauman Welch version algorithm. All the states at same position of the triphone HMM derived from same microphone 
HMM were clustered using further neighbourhood hierarchical clustering algorithm. The output distribution in the same 
cluster were tied to reduce the number of parameters and to balance the complexity against the available data .Tied 
triphone models were re estimated with embedded training again. 

Finally the data was aligned to the models via viterbi algorithm to obtain state duration densities . each of the state duration 
densities was modelled by single Gaussian distribution . 
Speech Synthesis 

An aritbitrary given text to be synthesized is converted in phoneme sequence Then triphone HMM corresponding to 
the phoneme sequence are concatenated to obtain HMM sentence which represents the whole text to be synthesized . 
Instead of triphones which did not exist in the training data , monophone models are used .. From the sentence HMM , a 
speech parameter sequence is generated using algorithm. By using MLSA filter speech is synthesized from the generated 
mel cepstral coefficient directly . 
Subjective Experiments 

Subjective test were conducted to evaluate the effect of including dynamic feature and to investigate the 
relationship between the number of states of tied triphone HMMs and the quality of speech synthesized .The test sentence 
consisted of twelve sentences which were not included in training sentences. Fundamental frequency contours were 
extracted from natural utterances, and used for speech synthesis using linear time warping within each phoneme to adjust 
phoneme duration of extracted fundamental frequency contours to generated parameter sequence . In the test sentences set, 
there exist 619 distinct triphonens in which 38(5.8%) triphones where not included in training data and replaced by 
monophones. The test sentence set where divided into three set and each set was evaluated by individual subjects . Subjects 
were presented with a pair of synthesized set at each trial, and asked to judge which of two speech samples sounded better 

Effect of dynamic features 

To investigate the effect of dynamic features , a pared compression test was conducted . Speech samples used in the 
test were synthesized using (1) speech spectral sequence generated without dynamic feature from model stringed using 
static features 

(2) spectral sequence generated using only static features and then linearly interpolated between the centers of state 
duration. 

(3) Spectral sequence generated using static and delta parameters from the models trained using 
parameter 

(4) Spectral parameters generated using static , delta and delta -delta from the models trained using st 
delta parameters. All the models were triphone models without sate tying. 



c and dynamic 
delta and delts- 



100.00% 
80.00% 
60.00% 
40.00% 
20.00% 
0.00% 



.11 



Figure:3 Effects of dynamic features 
Figure:3 shows the results of the paired comparison test. Vertical axis denotes the preference score . From the result it can be 
seen that the score for synthetic speech generated using dynamic feature are much higher than those synthetics speech 
generated using static features with and without linear interpolation. This is due that by exploiting statics of dynamic features 
for speech parameter generation, generated spectral sequence can reflect not only shapes spectra but also transition 
appropriately comparing to spectral sequence generated using static features only with linear interpolation. 



State tying 

To investigate the relationship between total number of state of tied tri phone HMMs and quality of speech 
synthesized speech, paired comparison test were conducted using 3 and 5 tied triphone HMMs .By modifying stop character 
for state clustering several sets of HMMs which had different numbers of states were prepared for test. For 3-stse HMMs 
comparison were performed using triphone model without state tying(totally 10,544 states ) tied triphone models with totally 
l,961and 1,222 states and monophone models (183 states) and for 5 state HMMs triphone models without state tying 
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(totally 17,590 states ), tied triphone models with totaly 2,040 and 1,199 states and monophone models (305 states). It is 
noted that sate duration distribution of triphone models were also used for monophone models to avoid of phoneme duration 
on speech quality. 




Figure: 4a 3 state HMMs 
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Figure :4b 5 state model 



Figure: 4a and Figure:4b shows the result for 3 state and 5 state HMM. From the result , it can be seen that the quality of 
synthetic speech degrade as the number of states decrease .From informal listening test and investigation of generated 
spectra , it was observed that shapes of spectra were getting flatten as the number of state decreases and this cause 
degrading of in intelligibility . It was observed that that the audible discontinuity in synthetic speech increased as the 
number of state increased , meanwhile the generated spectra varied smoothly when the number of states were small. The 
discontinuity caused in the lower score for 5 state triphone models compared to 3 - state triphone models. It is noted that 
significant degradation in communicability was not observed even if the mono phone models were used for speech 
synthesis . 
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Figure: 5 Comparison of 3-state and 5- state HMMs 



Along x-axis 

1 3-state HMM 

2 3-state HMM 

3 5-stateHMM 

4 5-state HMM 



(1961 states) 
(1222 states) 
(2040 states) 
(1199 states) 

From the above figure it can be seen that scores for 3-state and 5-state models were almost equivalent when number of 
states were almost 2000, the score of 5-state models was better when the number states were 1200. When the total number 
of tied states were almost the same , 5- sate model has higher resolution in time than 3- sate model reversely, 3-sate model 
has better resolution in parameter space than 5-state model .From the result if total number of state state is limited, models 
with higher resolution in time can synthesize more naturally sounding speech than model with higher parameter resolution 



III. Conclusion 

In parameter generation algorithm a speech generation sequence is obtained so that likelyhood of HMM for 
generated parameter sequence is maximized. By exploiting the constraints between static and dynamic features the 
generated parameter sequence results not only for static of shapes of spectra but also transition obtained from training data 
appropriately, resulting in smooth and realistic spectral sequence .In parameter generation algorithm a problem of 
generating parameter speech was simplified assuming that parameter sequence was generated along single path. The 
extended parameter algorithm using multi-mixture HMMs model has more ability to generate natural soundings speech, 
however extended algorithm has more computational complexity since it is based on expectation- maximization algorithm 
, which results in iteration of forward -backward algorithm and parameter generation algorithm. 
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